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Abstract 

Designing a highly concurrent data structnre is an important challenge that is not easy to meet. As we 
show in this paper, even for a data structnre as simple as a linked list nsed to implement the set type, the 
most efficient algorithms known so far may reject correct concurrent schedules. 

We propose a new algorithm based on a versioned try-lock that we show to achieve optimal concurrency: 
it only rejects concurrent schedules that violate correctness of the implemented type. We show empirically 
that reaching optimality does not induce a significant overhead. In fact, our implementation of the optimal 
algorithm outperforms both the Lazy Linked List and the Harris-Michael state-of-the-art algorithms. 


1 Introduction 

Multicore applications require highly concurrent data structures. Yet, the very notion of concurrency is vaguely 
defined, to say the least. What is meant by a “highly concurrent” data structure implementing a given high-level 
object type? Generally speaking, one could compare the concurrency of algorithms by running a game where an 
adversary decides on the schedules of shared memory accesses from different processes. At the end of the game, 
the more schedules the algorithm would accept without hampering high-level correctness, the more concurrent 
it would be. The algorithm that accepts all correct schedules would then be considered concurrency-optimal. 

To illustrate the difficulty of optimizing concurrency, let us consider one of the most concurrency-friendly 
data structures [TB]: the sorted linked list used to implement the integer set type. Since any modification on a 
linked list affects only a small number of contiguous list nodes, most of update operations on the list could, in 
principle, run concurrently without conflicts. For example, one of the most efficient concurrent list-based set to 
date, the Lazy Linked List [9], achieves high concurrency by holding locks on only two consecutive nodes when 
updating, thus accepting modifications of non contiguous nodes to be scheduled in any order. The Lazy Linked 
List is known to outperform the Java variant [TB] of the CAS-based Harris-Michael algorithm under low 

contention because all its traversals, be they for read-only operations or to find the nodes to be updated, are wait- 
free, i.e., they ignore locks and logical deletion marks. As we show below, the Lazy Linked List implementation 
is however not concurrency-optimal, raising two questions: Does there exist a more concurrent list-based set 
algorithm? And if so, does higher concurrency induce an overhead that precludes higher performance? 

The concurrency limitation of the Lazy Linked List is caused by the locking strategy of its update operations: 
both insert(u) and remove(w) traverse the structure until they find a node whose value is larger or equal to v, at 
which point they acquire locks on two consecutive nodes. Only then the existence of the value v is checked: if v 
is found (resp. not found), then the insertion (resp., removal) releases the locks and returns without modifying 
the structure. By modifying metadata during lock acquisition without necessarily modifying the structure 
itself, the Lazy Linked List over conservatively rejects certain correct schedules. 

To illustrate that the concurrency limitation of the Lazy Linked List may lead to poor scalability, consider 
Figure that depicts the performance of a 100-element Lazy Linked List under a workload of 10% updates 
(insertions/removals) and 90% of contains on a 64-core machine. The list is comparatively small, hence all 
updates (even the failed insertions and removals) are likely to contend. We can see that when we increase 
the number of threads beyond 40, the performance drops significantly. This observation unveils an interesting 
desirable data structure property by which concurrent operations conflict on metadata only when they conflict 
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Figure 1: The concurrency limitation of the Lazy Linked List based set leads to poor scalability with only 10% 
updates as operations potentially contend on meta-data even when they do not modify the structure 


on data. Note that this property extends the original notions of DAP [Diaii! that are trivially ensured by most 
linked-list implementations simply because all their operations “access” the head node and, thus, are allowed 
to conflict on the meta-data. 

Our main contribution is the Versioned List, the most concurrent (optimally concurrent, actually) and the 
most efficient list-based set algorithm to date. It exploits the logical deletion technique of Harris-Michael that 
divides the removal of a node into a logical and a physical step, and the wait-free traversal of the Lazy Linked 
List. In contrast to these techniques, it relies on a novel synchronization step inspired by transactional memory 
(TM): an update operation uses a CAS to set a versioned try-lock, based on the recent StampedLock of Java 
8, immediately after the validation of the node succeed^ If acquiring the try-lock fails, then the operation 
restarts. 

We show that the resulting algorithm rejects a concurrent schedule only if otherwise the high-level correctness 
of the implemented set type (linearizability [I8j l is violated. Our algorithm is thus provably concurrency- 
optimal: no other correct list-based set algorithm can accept more schedules. 

The evaluation of our versioned list shows that achieving optimal concurrency does not necessitate a costly 
overhead. Extensive experiments on two 64-way multi-core architectures (x86-64 and SPARC) confirmed that 
the Versioned List outperforms the state-of-the-art algorithms mm- In particular, as our algorithm differs 
from the Lazy Linked List by validating before locking, it outperforms the Lazy Linked List performance by 
3.5 X for 64 threads on the workload of Figure In addition, as our algorithm differs from Harris-Michael by 
avoiding metadata accesses during traversals, it outperforms the Java variant of Harris-Michael’s (even with 
its RTTI optimization 0 ) by up to 2.2 X on read-only workloads. 

In the rest of the paper, we describe our system model (Section]^, present the Versioned List and prove 
it correct (Section]^. We show it concurrency-optimal as opposed to previous work (Section]^. We evaluate 
its performance in Section]^ Finally, we discuss the related work (Section]^ and conclude (Section]^. The 
sequential specification of the set type and the missing proofs are deferred to Appendices andrespectively. 


2 Preliminaries 


Objects and implementations. We consider a standard asynchronous shared-memory system, in which 
n > 1 processes pi,... ,p„ communicate by applying operations on shared objects. An object is an instance of 
an abstract data type that specifies the set of operations the object exports, the set of responses the operations 
return, the set of states the object can take, and the sequential specification that stipulates the object’s correct 
sequential behavior. To implement a high-level object from a set of shared base objects, processes follow an 
algorithm, which is a collection of deterministic state machines, one for each process. The algorithm assigns 
initial values to the base objects and processes and specifies the base-object operations a process must perform 
when it executes every given operation. To avoid confusion, we call operations on the base objects primitives. 
A primitive is an atomic read-modify-write (rmw) on a base object m characterized by a pair of deterministic 

^The possibility of “pre-locking validation” was suggested in [9], but to the best of our knowledge, no algorithm was proposed 
to implement it. 
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functions (g, h)\ given the current state of the base object, g is an update function that computes its state after 
the primitive is applied, while h is a response function that specifies the outcome of the primitive returned to 
the process. Special cases of rmw primitives are read {g leaves the state unchanged and h returns the state) 
and write {g updates the state with its argument and h returns ok). 

Executions. An event of a process pi is an invocation or response of an operation performed by pi on a high- 
level object implementation, a rmw primitive ( 5 , h) applied by pi to a base object b along with its response r 
(we call it a rmw event and write (&, {g, h), r, i)). A configuration specifies the value of each base object and the 
state of each process. The initial configuration is the configuration in which all base objects have their initial 
values and all processes are in their initial states. 

An execution fragment is a (finite or infinite) sequence of events. An execution of an implementation / is an 
execution fragment where, starting from the initial configuration, each event is issued according to I and each 
response of a rmw event {b,{g,h),r,i) matches the state of b resulting from all preceding events. We assume 
that executions are well-formed: no process invokes a new high-level operation before the previous high-level 
operation returns. 

Let a\pi denote the subsequence of an execution a restricted to the events of process pi. Executions a and 
cJ are equivalent if for every process pi, a\pi = a'\pi. An operation tt precedes another operation n' in an 
execution a, denoted tt — )■„ tt', if the response of tt occurs before the invocation of tt' in a. Two operations 
are concurrent if neither precedes the other. An execution is sequential if it has no concurrent operations. An 
operation is complete in a if the invocation event is followed by a matching response; otherwise, it is incomplete 
in a. Execution a is complete if every operation is complete in a. 

High-level histories and linearizability. A high-level history H of an execution a is the subsequence of a 
consisting of all invocations and responses of (high-level) operations. 

A complete high-level history H is linearizable with respect to an object type r if there exists a sequen¬ 
tial high-level history S equivalent to H such that (1) —>5 and (2) S is consistent with the sequential 
specification of type t. 

Now a high-level history H is linearizable if it can be completed (by adding matching responses to a subset 
of incomplete operations in H and removing the rest) to a linearizable high-level history [T511^. 

Sequential implementations. A sequential implementation of an object type r specifies, for each operation 
of r, a deterministic procedure that performs read and write primitives on a collection of base objects that 
encode the state of the object, and returns a response, so that the specification of r is respected in all sequential 
executions. 

Consider the conventional set type exporting insert, remove, and contains operations with standard sequential 
semantics: insert(z;) adds v to the set and returns true if v is not already there, and returns false otherwise; 
remove(ti) drops v from the set and returns true if v is there, and returns false otherwise; and contains(u) returns 
true if and only if v is in the set. The exact specification and the list-based sequential implementation of set 
are presented in the appendix (Algorithm]^. 


3 The Versioned List Set 

In this section, we describe our Versioned List implementation of the set (Algorithmj^ and prove it linearizable. 

List nodes. Each node in the list has 4 fields as depicted in Lines |2]jT^ val stores the value of the node, next 
is a pointer/reference to the next node, deleted is a boolean marker initialized to false, which indicates whether 
a node has been removed from the list, and the versioned try-lock is described below. 

The versioned try-lock. The versioned try-lock is defined as a pair {ver, lock) of a version number and a 
boolean locking state, that can be modified together atomically. The versioned try-lock (Algorithm]^ supports 
the following operations: 

— getVersion(): returns the current version. 

— tryLockAtVersion(wer): tries to change the locking state from false to true atomically if the current version 
number matches ver. The method fails and returns false if either the node is already locked or the version 
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1 : Shared variables: 

2 : node is a record with fields: 

3: val, its value 

4: next, its reference to the next node in the list 

5: deleted, a boolean indicating whether the node is 

6: logically deleted 

7 : vlock, a versioned lock: counter whose least significant 

8: bit is a lock 

9: Initially the list contains only two nodes head, tail, 

10 : head.val = —oo, tail.val = +oo 

11 : head.next = tail 

12 : head.deleted = tail.deleted = false 

13: head.vlock = tail.vlock = (0, false) 

14 : contains(u): > wait-free contains 

15 : curr -(r- head 

16 : while curr.val < v do 

17: curr curr.next 

18 : return {curr.val = v A —> curr .deleted) 

19: validate(u, pre^’): 

20 : pVer i— prev.vlock.get\/ers\or\{) \> return lock version 

21 : if prev.deleted then return ± > full abort 

22 : curr •<— prev.next 

23: while curr.val < v do 

24 : pVer ^ curr.vlock.get\/ers\on{) 

25 : if curr.deleted then goto line |20| > partial abort 

26 : prev ■<:— curr > the above line checks prev.deleted 

27 : curr curr.next [> same as reading prev.next 

28 : return (prev, p Ver, curr) 


29 ; waitfreeTraversal(v): [> wait-free traversal used in updates 

30 ; prev ^ curr ^ head 

31 ; while [curr.val) < v do [> until position is reached 

32 ; prev •<— curr > keep track of the previous node 

33 ; curr -(r- curr .ncxt 

34 ; return prev 

35; insert(v): 

36; prev -(r- waltfreeTraversal (v) 

37 ; if {{prev,pVer, curr) •<— validate(v,prev)) = _L then 
38 ; goto line |36| > full abort: restart from beginning 

39 ; if curr.deleted then goto line |37| 

40 ; if curr.val = v then return false > v already in the set 
41 ; newNode.val •<— v > allocate a new node with value v 
42 ; newNode.next •<— curr 

43 ; if -iprev.vZocfc.tryLockAtVersion(p Ver) then [> v.-lock 
44 ; goto line |37| > partially abort 

45 ; prev.next ■<— newNode 

46; prev.vZocfc.unlockAndlncrementVersion() 

47 ; return true 

48; remove(v): 

49 ; prev ^ waitfreeTraversal(v) 

50 ; if ((prev,pVer, curr) ■<— validate(v, prev)) = _L then 
51 ; goto line |49| [> full abort: restart from beginning 

52; if [curr.val 7 ^ v V curr.deleted) then return false 
53 ; if -'prev.vZocA:.tryLockAtVersion(p Ver) then [> v.-lock 
54 ; goto line |50| > partially abort 

55 ; curr.vZocfc.lockAtCurrentVersion() |> spin lock 

56 ; curr.deleted ■<— true > logical delete 

57 ; prev.next curr.next > physical delete 

58 ; curr.vZoc/c.unlockAndIncrementVersionO 
59 ; prev.vZocfc.unlockAndlncrementVersion() 

60 ; return true 


Algorithm 1: The versioned list-based set 


does not match; otherwise it succeeds and returns true. 

— lockAtCurrentVersion(): spins until it acquires the lock with the current version. 

— unlockAndlncrementVersion(): only called by the process that previously acquired the lock via successful 
tryLockAtVersion(uer) or lockAtCurrentVersion(). It unconditionally sets the locking state from true to false 
and increments the current version number atomically from {ver,true) to (wer-|-l, false). 

In our Java 8 implementation of the versioned try-lock (Algorithm]^, we tested a single integer variable 
Atomicinteger that supports single-word CAS as well as the more recent StampedLocI^ One could alternately 
use the least significant bit of an integer on x86 architectures to represent the locking state where 0 means 
“unlocked” and 1 means “locked”, hence losing portability. Distinct version numbers are represented by all the 
even values: to extract the version we use a bit-mask that always sets the last bit to 0 when doing a bitwise 
and. 

We now describe our list-based set implementation. Recall that the set type exports operations insert)?;), 
remove)?;) and contains)?;), with v G Z )see the appendix for a detailed specification). The list is initialized with 
2 nodes: head )storing the minimum sentinel value) and tail )storing the maximum value), head.next storing 
the pointer to tail. 

Contains. The algorithm for contains simply traverses in the wait-free manner, exactly as in the sequential 
algorithm )Algorithm[^of Appendix [A| except that in the end, we also check that curr is not deleted )Line[T8|. 

https://docs. oracle. com/javase/S/docs/api/java/util/concurrent/locks/StainpedLock.htinl 
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1 : 

Private field: 


8 ; 

lockAtCurrentVersion(): > spin lock on the latest version 

2 : 

lockValue, an integer that supports CAS 


9 ; 

success ^ false 


3 : 

Initially lockValue = 0 


10 ; 

while (—‘Success) do 

[> spin until we get the lock 




11 : 

ver -(r- getVersion() 


4 : 

getVersion(): > only return 

even value 

12 ; 

success •<— /ocfcVa/ue.tryLockAtVersion(uer) 

5 : 

return lockValue.reQ.d{) & 1111...1110 

> bitwise 

13 ; 

unlockAndlncrementVersion(): 

\> assuming locked (odd val) 

6 : 

tryLockAtVersion(uer): > assuming 

ver is even 

14 ; 

val ■<— lockValue.YeadO 


7 : 

return lockValue.QKSiver., ver 1) > next odd value 

15 ; 

lockValue.(JA^{val, val + 1) 

> use atomicSet if available 


Algorithm 2: The CAS-based implementation for the versioned try-lock 


This wait-free traversal, introduced by Heller et al. [3], results in a highly efficient contains algorithm, as its 
only overhead (compared to the sequential implementation) comes from a single memory read, on curr.deleted. 


Pre-locking validation in insert and remove. For insert and remove, we first traverse the list in a wait- 
free manner (Lines 29 -34) until we find the position where a node might be inserted or deleted, i.e., where 
prev.val < v and curr.val > v. Then we use the novel technique of pre-locking-validation: it validates the state 
of the nodes prior to locking. Generally, an optimistic lock-based algorithm follows this pattern for updates: 

1 : read data on node 
2 : lock node 

3 : re-read & validate integrity: if fail then unlock & restart 

4 : modify data 

5 : unlock node 


With pre-locking-validation, the new pattern becomes: 
1 : read node version ver 

2 : read & validate data integrity: if fail then restart 
3 : try-lock-at-version(z;er): if fail then restart 
4 : modify data 

5: unlock-and-increment-version 


The reason why we can validate before acquiring the lock is that the consistency of the validation result is 
protected by the version number. Observe that in this new pattern, any modification to the node first acquires 
a lock and the version is changed only when releasing the lock. Thus, in case any concurrent thread is modifying 
or has modified the node between the read-version and the try-lock-at-version steps, the try-lock will fail either 
because of a lock conflict or a new version number. 


The validate function. The validate function (Lines [20p8 ) invoked by our update operations is a short 
traversal that stores the version of prev, then checks that prev is not logically deleted and finally sets curr to 
prev.next. The validation conditions are (1) prev is not deleted (prev.deleted = false), and (2) prev.next points 
to curr (prev.next = curr). 

Note that after the traversal completes and before the validation starts, some new node could be inserted 
between prev and curr or curr could be deleted. Instead of using the curr node from the traversal to check 
whether prev.next = curr we simply re-traverse from the prev node. 

During validation and locking, we might fail due to conflicts with a concurrent operation, in which case 
we need to abort and restart the operation. We included an optimization of partial abort where instead of 
restarting from head, we only need to restart from prev under the condition that prev.deleted is not true (we 
already know that prev.val < v). As a result of this “versioned-traversal”, we get prev and curr together with 
prev's version pVer (Line 28): if the operation later successfully locks prev at pVer, we are sure that prev is 
not deleted and prev.next = curr. Finally, after the validation, we check curr.val to see if the value we are 
trying to insert or remove is present in (or absent from) the set. 

Inserting a node. For insert, we create the new node before entering the critical section (Line[4T|): the reason 
here is that we want to optimize concurrency and minimize the length of the critical section. However, it is 


possible for our implementation to execute the node creation (Line 41) multiple times since we could potentially 
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abort later and restart. A possible optimization is to keep track of the newNode reference and make a check 
before Line 41 so that we allocate the memory at most once per insert operation. We enter the critical section 
by trying to lock prev at version pVer atomically in Line {pVer is the version of the prev node obtained 
after validation in Line |37| ) . If we obtain the lock successfully, it implies that we are already in a valid state 
(conditions (1) and (2) are satisfied). Once we have successfully acquired the lock on prev, we link-in the new 
node (Line [4^. 


Removing a node. For remove, we also require the lock on curr, which is obtained using the spin-lock in 
Line |55[ ). We need to lock prev at version pVer for the same reason as insert: to make sure no concurrent 
thread is inserting a node between prev and curr or deleting the curr node. Additionally, we need to lock curr 
at its current version to prevent concurrent threads from inserting/deleting the node after curr. Removing a 
node now involves two steps: a logical delete that sets curr.deleted to true (Line[^ and a physical delete that 
changes prev.next (Line [57| . 

Finally, we exit the critical section by releasing the locks on the node(s) involved, increment the version in 
one atomic step (Line [4^ [5^ and return true for the operation. 


Progress. It is easy to see that the contains operation is wait-free: a matching response is returned within a 
finite number of its events. The update operations ensure deadlock-freedom: assuming no process fails in the 
critical section, some process makes progress by completing each of its operations. 


Proof of linearizability. We now show that the Versioned List algorithm is linearizable with respect to the 
set type. Let a be a finite execution of Algorithm and <a denote the total-order on events in a. For the 
sake of the proof, we assume that a starts with an artihcial sequential execution of an insert operation ttq that 
inserts tail and sets head.next = tail. Let H be the high-level history exported by a. 


Completions. We obtain a completion H oi H as follows. The invocation of an incomplete contains operation is 
discarded. The invocation of an incomplete tt = remove operation that has not performed the write in Line |56| is 
discarded; otherwise, it is completed with response true. The invocation of an incomplete tt = insert operation 
that has not performed the write in Line is discarded; otherwise, it is completed with response true. 


45 


Linearization points. We obtain a sequential high-level history S equivalent to H by associating a linearization 
point with each operation tt as follows. 

For every tt = insert(u) that returns true in H, is associated with the write event in Line 
the node that stores v reachable from the head); otherwise is associated with the last read of a node’s next 
field performed by tt in a. 

(setting the 


(rendering 


56 


For every tt = remove(u) that returns true in H, is associated with the write event in Line 
deleted flag of a list’s element); otherwise is associated with the last read of a node’s next field performed by 
TT in a. 

For TT = contains(u) that returns true, is associated with the last read performed by tt in which tt finds 
the deleted held of a reachable node storing v be false (Line[I^. 

For TT = contains(u) that returns false in H, 



by TT. Then, is chosen to be the hrst event performed by tti immediately after the write to X. deleted, but 
prior the read of X.deleted by tt. Otherwise if no such event of tti exists, then is the read of X.deleted by 

TT. 


Since linearization points are chosen within the intervals of operations performed in a, for any two operations 
TTi and TTj in H, if tt, — tTj, then tt, — tTj. Intuitively, the linearization point of each insert (resp., remove) 
operation determines the instance when the operation takes effect, i.e., the corresponding element becomes 
reachable (resp., unreachable). A successful contains(z;) operation is linearized at the moment an “undeleted” 
list element storing v has been reached. A failed contains operation is linearized at the moment it detects that 
no “undeleted” node storing v can be reached. 


6 








insert(l) h 


R{h) R{t) new(Xi) W{h) 


■♦true 


R{h) R{t)new{X 2 ) W{h) 

Insert(2) |-•-•-•-•— 


I insert(2) overwrites the 
I effect of insert{l) 

-♦true! in a “lost update” 


Figure 2: A history exporting an observably incorrect schedule cr; for succinctness, R{h) and R(t) refers to 
reads of both val and next fields; W(h) refers to write on head.next 


R{h) 

insert(2) |-•— 


R{Xi) 


R{h) 

in5ert{l) |-•— 


insert(2) is incomplete 

-^- E' 


E 


insert{2) holds the lock 
: on Xi after E' 


► false 

insert(l) must acquire 
the lock on Xi prior to 
returning false in E 


Figure 3: A schedule rejected by the lazy linked list; initial state of the list is {ATi} that stores value 1; R{Xx) 
refers to reads of both the val and next helds; new(X 2 ) creates a new node storing value 2 


Thus, we can prove the linearizability of the versioned list w.r.t the set type (the proof is given in Ap¬ 
pendix 1^ : 

Theorem 1. Versioned List is linearizable with respect to the set type. 


4 Concurrency analysis 

To characterize the ability of a concurrent implementation to process arbitrary interleavings of sequential 
code, we introduce the notion of a schedule. Intuitively, a schedule of an execution of a list-based set algorithm 
specifies the order in which high-level operations access the nodes of the list. List-based set algorithms generally 
follow the sequential implementation (denoted LL) of operations insert, remove and contains: every high-level 
operation reads the nodes sequentially until the desired fragment of it is located. The update operation {insert 
or remove) then writes to the next field of one of the nodes the address of a new node (if it is insert) or the 
address of the node that follows the removed node in the list (if it is remove). (The sequential write can be 
implemented using a CAS primitive [5].) For the detailed pseudocode of the sequential implementation followed 
by the concurrent linked-list implementations, we refer to Algorithm in the appendix. 


4.1 Schedules and local serializability 

An execution of our concurrent implementation involves reading and writing to the nodes fields val and next, 
as well as reading and modifying meta-fields such as deleted, and vlock (cf. Section |^. Naturally, we identify 
the events in the execution of the concurrent implementation corresponding to the “sequential” reads, writes 
(of val and next fields) and node creation events (Line in the sequential implementation LL) as marked 
explicitly. 

Let a be an execution of our concurrent implementation. We define the history of an execution a as the 
subsequence of a corresponding to the events that “take effect”. Formally, for every update operation tt in a, 
H\Tr is defined to be the subsequence Q;|7r consisting of the reads, writes and node creation events from the last 
invocation of the function waitfree-traversal by tt in Lines 36 and 49 For every contains operation tt in a, H\tt 


is defined to be the subsequence a\TT consisting of the reads and writes on a node’s val and next fields. 

Intuitively, a schedule corresponds to some interleaving of the sequential reads, writes, node creation events 
and invocation and responses of high-level operations performed in the sequential implementation LL. Formally, 
a schedule is an equivalence class of histories that agree on the order of reads, writes, node creation events and 
high-level operations, but not neccesarily on the responses of high-level operations and read events. Observe 
that, in our concurrent implementation, every read operation (on a base object x) returns the argument of the 
latest preceding write (on x). Thus, for every history, there exists exactly one schedule. 


Definition 1. We say that a schedule a is locally serializable (with respect to the sequential implementation 
of list-based set LL) if for each of its operations tt, there exists a history S of LL such that a\TT = S'Itt. 
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Definition 2. We say that a schedule is correct if it is (1) linearizable (with respect to the set type), (2) locally 
serializable (with respect to LL). 


Theorem 2 (Correctness). The Versioned List implementation accepts only correct schedules. 

Proof. Take any schedule a of Algorithm Theorem implies that the high-level history of a is linearizable 
with respect to the set type. 

To show local serializability, we first remark that every operation traverses the list starting from the head 
node and reads the next field of a node to locate the subsequent node. Before adding a new node to the list 
(Line 451, each insert operation initializes the node’s val and next filed, so that at all times the next field of 
a node stores a pointer to an inserted node with a strictly higher value or to the tail node. Furthermore, the 
values stored in the list are integers, every operation invoked with parameter v eventually locates the node 
storing u or a higher value. Thus, every sequence of non-aborted events of every operation tt is finite. Hence, 
there exists a sequence of insert operations Sq, such that Sq • cr|7r is a sequential history of LL. □ 


4.2 Optimal concurrency 


We show that any finite schedule rejected by our algorithm is not observably correct. A correct schedule a is 
observably correct if by completing update operations in a and extending, for any v G Z, the resulting schedule 
with a complete sequential execution contains(?;), applied to the resulting contents of the list, we obtain a 
correct schedule. Here the contents of the list after a given correct schedule is determined based on the order 
of its write operations. For each node, we define the resulting state of its next field based on the last write 
in the schedule. Since in a correct schedule each new node is first created and then linked to the list, we can 
reconstruct the state of the list by iteratively traversing it, starting from head. 

Intuitively, a schedule is observably correct if it incurs no “lost updates”. Consider, for example a schedule 
(cf. Figure]^ in which two operations, insert(l) and insert(2) applied to the initial empty set. Imagine that 
they first both read head, then both read tail and then both perform writes on the head.next. The resulting 
schedule is trivially correct (both operations return true so the schedule can some from a complete linearizable 
history). However, in the schedule, one of the operations, say insert(l), overwrites the effect of the other one. 
Thus, if we extend the schedule with a complete execution of contains(2), the only possible response it may 
give is false which obviously does not produce a linearizable high-level history. 

Theorem 3 (Optimality). Versioned List accepts all observably correct schedules. 


Proof. Let a be any schedule of our concurrent implementation. Recall that, for every update operation tt in 
a, (j\t: is defined to be the subsequence (j\t: consisting of the reads, writes and node creation events from the 
last invocation of the function waitfree-traversal by tt in Lines 36 and 49 and for every tt = contains in a, cr|7r 
is the subsequence ctItt consisting of the reads and writes on a node’s val and next fields. 

We prove that any schedule rejected by our algorithm is not observably correct. More precisely, we show 
that an operation restarts a fragment of it execution (in Lines 44 541 only if extending it with a read or a write 
on next or val fields would result in schedule that is not observably correct. 

We first observe that if a node is logically deleted (Line 56), then its next write renders the node unreachable 
from the head node (Line 571. Thus, an update operation tt partially restarts because of reading a logically 
deleted node (Lines |38[ or 511 only if it is concurrent with a remove operation which, when completed would 
physically remove the node addressed by tt at the end of its traversal, ft is easy to see that regardless of what 
this operation tt is (insert(u) or remove(u)), if we complete it in turn and then extend the resulting schedule 
with contains(u), the effect of tt will not be seen and the schedule will not be linearizable. 

Similarly, an update operation tt partially restarts in Lines |44[ |54| after finishing its traversal phase, if it 
fails in grabbing a lock on one of the nodes it is about to modify. Thus, tt is concurrent with another update 
operating on the same node. Again, by completing both tt and the concurrent update, we obtain a schedule in 
which one of the updates is “lost”, so that its extension with some contains(?;) will not be linearizable. □ 


Theorem]^ shows that our implementation only rejects concurrent schedules that would result in a violation 
of linearizability. On the other hand, we can easily describe observably correct schedules that are rejected by 
the Lazy Linked list and Harris-Michael Linked list implementations mm- 


The Lazy Linked List. Our example illustrates how the post-locking validation strategy employed by the Lazy 
Linked list makes it sub-optimal w.r.t concurrency. As explained in the introduction, the insert operation of the 














R{h) 

insert(l) |-•— 

remove(2) |— 


R{X 2 ) new{Xi) 


R{h) R{X2) 


X 2 is logically deleted 

but still reachable from head after E 

E 

W(h) R{h) R{Xi) 

—•-♦true ; insert(4) |-•-•- 

W{h) 

• ♦true insert(3) |- 


insert(4) fails to return false 
attempted CAS on Xi fails 
operation must restart 


■♦false 


CAS on head by insert(l) succeeds 
CAS on head by remove(2) fails 


Complete execution of insert(3) 
performs the physical deletion of X 2 


-♦false 


Figure 4: A schedule rejected by the Harris-Michael linked list; initial state of the list is {X 2 , X^, X 4 }; each 
Xi stores value i; R{X) refers to reads of both the val and next fields; W{h) is CAS that attempts to set 
head.next to the desired node if it has not changed since the previous read 


Operation 

Algorithm 

Number of concurrent threads 

segment 

1 

2 

4 

8 

16 

24 

32 

40 

48 

56 

64 

72 


Lazy linked list 

2.9 

8.9 

16.4 

30.2 

58.5 

65.5 

61.0 

61.0 

71.7 

100.4 

75.4 

68.9 

Traversal 

Harris-Michael 

5.2 

9.9 

15.9 

29.2 

59.0 

91.4 

127.9 

153.3 

177.1 

203.2 

228.9 

252.8 


Versioned list 

3.7 

7.2 

12.3 

22.8 

42.3 

61.5 

79.5 

89.2 

96.0 

109.6 

122.2 

202.9 


Lazy linked list 

3.7 

11.6 

17.1 

26.7 

88.8 

211.1 

503.0 

1019.5 

1355.7 

1814.6 

2163.9 

2634.2 

Update 

Harris-Michael 

1.0 

3.8 

4.8 

5.4 

6.4 

7.0 

8.3 

9.4 

10.2 

11.1 

12.9 

13.7 


Versioned list 

2.3 

3.8 

4.5 

5.4 

6.4 

7.3 

8.0 

9.2 

10.9 

12.9 

16.2 

162.6 


Table 1: The relative time spent on list traversal and node update per operation on average using the benchmark 
with size 100 and update ratio 100% 


Lazy Linked list acquires the lock on the nodes it writes to, prior to the check of the node’s state. Consider the 
schedule depicted in Figureinsert(2) traverses the list, reaches node Xi storing value 1, acquires the lock on 
Xi and creates a new node that stores value 2. Observe that, at this point in the execution, the implementation 
has not performed the write to Xi (corresponding to the write in the sequential implementation LL in Line 13) 
and thus, must hold the lock on Xi after E' . However, insert(l) must also acquire the lock on Xi prior to 
returning a matching response false. But it cannot do so until insert(2) releases the lock on Xi. Consequently, 
the Lazy Linked list cannot export the schedule depicted in Figure 


The Harris-Michael linked list. In the Harris-Michael linked list (cf. [HI Chapter 9]), each update 
operation attempts to physically remove (using CAS) nodes that are marked for deletion as it traverses the 
list. If the attempt fails, the operation is restarted. Figure [^depicts a schedule cr that is rejected by the Harris- 
Michael algorithm. The initial state of the list {X 2 ,X^,X 4 }, where each Xi stores value i. First insert(l) runs 
concurrently with a remove(2), where insert(l) performs a CAS on head to set head.next to Xi, after which the 
remove(2) performs a logical deletion of X 2 (by setting a deleted flag) and then invokes CAS to set head.next 
to X 3 . However, this CAS fails, and the operation returns true after having only logically deleted X 2 . Thus, at 
the end of this execution, X 2 is still reachable from the head. We now extend this execution with an insert(4) 
that reads head, Xi and prior to the attempted physical removal of X 2 , a concurrent insert(3) performs this 
physical removal, thus forcing insert(4) to restart. Therefore, Harris-Michael implementation cannot accept a 
(clearly, accepted by Versioned List). 


5 Experimental evaluation 

We compared our versioned list in Java to the lock-based Lazy Linked List [S] and Harris-Michael’s non-blocking 
list |5J[T5] with its wait-free and RTTI optimization suggested in Java by Heller et al. [3] using the Synchrobench 
benchmark suite [1]. For the versioned list, we tested both a hand-crafted versioned lock and one implemented 
on top of the Java 8 StampedLock and only report the better results of the latter. The source code is publicly 
available as part of Synchrobench at https://github.com/gramoli/synchrobench/tree/master/java/src/ 
linkedlists 
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Figure 5: The performance results obtained on the x86 architecture 
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Figure 6: The performance results obtained on the SPARC architecture 


5.1 Setup 

We report the performance results of the three implementations obtained from two different architectures (x86 
and SPARC). More precisely, we performed the experiment on a 4 socket AMD Opteron 6378 2.4 GHz 16-core 
(64-core in total) running Linux Fedora 18 and on a Sun Niagara 2 running SunOS 5.1 with 8 cores each 
running 8 simultaneous multiple threads at 1.165 GHz (64 hardware threads in total). Both architectures run 
the 64-bit Java HotSpot server VM version 1.8 update 25. 

Synchrobench initializes the data structure by filling it up to a predefined size with values chosen randomly 
from a range = {1,2, ...,2 x size}. It spawns from 1 to 72 threads. Running for 10 seconds, each thread 
repeatedly chooses one of the three operations with a fixed probability distribution function defined by the 
update ratio and executes it with an argument picked uniformly at random from the range. 

The rationale behind the workload choice is to keep the size constant in expectation during the benchmark 
execution. This is the case because both the insert and remove values are chosen from the range that is twice 
the initial size, which means both will have a 50% chance to choose a value already in the list and same chance 
for the value to be absent. Note that the expected number of effective insert (inserting a value absent in the 
list) and effective remove (removing a value present in the list) is the half the update ratio. 

Our experiments are done for list sizes in (100,1000,10000}, update ratios in (0%, 10%, 100%}, and number 
of concurrent threads in {1, 2,4,8,16,24,..., 72}. The results presented here are the average of 10 runs of 10 
seconds for each point in the parameter space. 

5.2 Results and evaluation 

Figures]^ and [^depict the number of operations per millisecond obtained on x86 and on SPARC, respectively. 
We only report the results for a list size of 100 on the SPARC architecture, since the curves on higher list sizes 
were similar for both architectures. 

The left column of both figures is the contains-only workload, the right column is the update-only (insert 
and remove only) workload, and the middle column is a more realistic workload with 90% contains and 10% 
updates. Note that the level of contention increases from bottom to top as the list size decreases (leading to a 
higher chance of concurrent threads accessing the same node) and left to right as the number of write operations 
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increases. Also, since each operation has 50% chance to return false, the effective updates are roughly half of 
the shown percentage. 

We also studied the contribution of traversals and updates in the execution time used by insert and remove 
in each algorithm. The traversal time is defined as the time between operation invocation and when it finds 
prev and curr where prev.val < v < curr.val (including re-traversal time caused by abort and restart); for our 
algorithm, we also include the time of validate function. The update time is measured from just before locking 
to just after lock release; for Harris-Michael algorithm that does not use locks we simply measured the time 
taken by the CAS at the end of each update operation (excluding CASes that happen during list traversal). 
Table shows the relative execution time per operation per thread normalized by the lowest number among 
the data (shaded cell). 

We can see that our new list algorithm outperforms both Harris-Michael’s and the Lazy Linked List algo¬ 
rithms and remains scalable even under extremely heavy contention (the top-right corner of the throughput 
graphs). The only place where our algorithm drops in performance is when there are more threads than cores 
(above 64 threads, “core-saturation”) and contention is high. This is an inherent problem to all lock-based 
algorithms: a thread holding a lock gets preempted from the CPU, while any other thread contending on the 
same lock cannot make progress even if it is assigned the CPU time. (We can see this from Table that at 72 
threads our update in the critical section took more than 100% longer than 64 threads.) 

Comparison against Harris-Michael. Harris-Michael’s algorithm in general scales well and performs really 
well under high contention and core saturation (at 72 threads). This can be explained by the fact that the 
algorithm is nonblocking: a thread preempted from the CPU at any time does not indefinitely hamper the 
progress of other threads. 

We can see, however, on the left-hand side of the Figures and that even though the three algorithms 
feature the wait-free contains algorithm, our implementation of the Harris-Michael’s contains is slower than the 
other two. The reason is the extra indirection needed when reading the next pointer in the combined pointer- 
plus-boolean structure. Note that the original C-like pseudocode of Harris [5] suggested the architecture-specific 
use of a bitmask on x86. While we could have done so in Java using sun.misc.Unsafe this is not recommended at 
it may annihilate the portability of the implementation. To avoid the overhead of reading an extra field when 
fetching the Java AtomicMarkableReference we implemented the run-time type identification (RTTI) variant 
with two subclasses that inherit from a parent node class and that represent the marked and unmarked states 
of the node as previously suggested 0. This optimization requires, on the one hand, that a remove casts the 
subclass instance to the parent class to create a corresponding node in the marked state. It allows, on the 
other hand, the traversal to simply check the mark of each node by simply invoking instanceof on it to check 
the subclass the node instantiates. 

From Tablej^we can see that Harris-Michael’s algorithm has the most efficient updates because it only uses 
CAS, however it spends much longer on list traversal. We also found that above 40 threads, there is around 5% 
of the traversal time under 100% update workloads that is spent on attempts to unlink marked nodes during 
the traversal. 

Comparison against the Lazy Linked List. The Lazy Linked List has almost the same performance as 
our algorithm under low contention (the left column and the bottom row in the graphs) because both share the 
same wait-free list traversal with zero overhead (as the sequential code does) and for the updates, when there 
is no interference from concurrent operations, the difference between our pre-locking-validation and Heller’s 
post-locking-validation becomes negligible. 

The difference raises however as the contention appears. The performance of the Lazy Linked List drops 
significantly due to its intense lock competition (as briefly explained in Section [^. By contrast, there are 
several features in our implementation that reduce the amount of contention on the locks significantly. For 
example, the pre-locking-validation that uses the versioned try-lock avoids aborting once the lock is acquired: if 
we cannot acquire the try-lock immediately because either the version has changed (meaning some concurrent 
thread has just modified the node) or the node is already locked (which means the version is going to change 
when it is unlocked), then we can already restart and try to read the new version. Another feature is that 
our insert and remove operations check the node value before locking so in case the update fails (because the 
value is present or absent), it returns with no particular overhead compared to contains. Table shows the 
tremendous increase in execution time for the Lazy Linked List because of the contention on locks. 
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6 Related work 


List-based sets. Heller et al. [5] proposed the Lazy Linked List algorithm, with a variety of optimizations. In 
particular, they mentioned doing a validation prior to locking, and using a single lock within an insert operation. 
One of the reasons why our implementation is faster than the Lazy Linked List is the use of a new versioned 
try-lock mechanism (hinted in [TS] for the TM context) that allows validating before acquiring the lock. 

Harris [8] proposed a non-blocking linked list algorithm that splits the removal of a node into two atomic 
steps: a logical deletion that marks the node and a physical removal that unlinks the node from the structure. 
Michael [15j proposed advanced memory reclamation algorithms for Harris’ algorithm. In our implementation, 
we rely on Java’s garbage collector for memory reclamation m- 

For a comprehensive survey of list-based sets, we refer to the textbook of Herlihy and Shavit [T2] . 

Concurrency metrics. Sets of accepted schedules are commonly used as a metric of concurrency provided 
by a shared-memory implementation. For static database transactions, Kung and Papadimitriou m use the 
metric to capture the parallelism of a locking scheme. While acknowledging that the metric is theoretical, 
they insist that it may have “practical significance as well, if the schedulers in question have relatively small 
scheduling times as compared with waiting and execution times.” 

Herlihy m employed the metric from El to compare various optimistic and pessimistic synchronization 
techniques using commutativity of operations constituting high-level transactions. A synchronization technique 
is implicitly considered in m as highly concurrent, namely “optimal”, if no other technique accepts more 
schedules. By contrast, we focus here on a dynamic model where the scheduler cannot use the prior knowledge 
of all the shared addresses to be accessed. Optimal concurrency can thus be seen as a variant of permissiveness, 
originally defined for opaque TM [5], applied to the case of dynamic data structures with high-level sequential 
semantics. 

In the TM context, Gramoli et al. [5] defined a concurrency metric, the input acceptance, as the ratio of 
committed transactions over aborted transactions for a given schedule. Unlike our metric, input acceptance 
does not apply to lock-based programs. 

7 Conclusion 

Intuitively, the ability of an implementation to successfully process interleaving steps of concurrent threads is 
an appealing property that should be met by performance gains. In this paper, we support this intuition by 
presenting a concurrency-optimal list-based set that outperforms (less concurrent) state-of-the-art algorithms. 
Does the claim also hold for other data structures? We suspect so. For example, similar but more general 
data structures, such as skip-lists or tree-based dictionaries, may allow for optimizations similar to the ones 
proposed in this paper. 
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A Sequential implementation of the set type 


1 

Shared variables: 


15 

remove(t;): 



2 

Initially head, tail, 


16 

prev ■<— head 


> copy the address 

3 

head.val — —oo, tail.val = +oo 


17 

curr ■<— read{prev.next) 


> fetch next field 

4 

head.next — tail 


18 

while {tval i— read{curr.val)) 

< V do 

|> val local copy 

5 

insert(t;): 


19 

prev -i— curr 



6 

prev i— head 

[> copy the address 

20 

curr ■<— read{curr.next) 



7 

curr i— read{prev.next) 

[> fetch the next element 

21 

if tval — V then 



8 

while (tval i— read{curr.val)) < v 

do 

22 

tnext read {curr .next) 

> fetch the node after curr 

9 

prev curr 


23 

wr\te{prev.next, tnext) 


\> delete the node 

10 

curr ■<— read{cu‘rr.next) 

> fetch from memory 

24 

return {tval — v) 



11 

if tval 7^ V then 

> tval is stored locally 





12 

X new-node(v, prev .next) 

> V and address of curr 





13 

wr\te{prev.next, X) > next points to the new element 

27 

curr ■<— read{prev.next) 



14 

return {tval ^ v) 


28 

while {tval ■<— read{curr.val)) 

< V do 





29 

curr -4— read{curr.next) 






30 

return {tval — u) 




Algorithm 3: Sequential implementation LL {sorted linked list) of set type 


An object of the set type stores a set of integer values, initially empty, and exports operations insert('(;), 
remove(n), contains(u); v G Z. The update operations, insert(n) and remove(n), return a boolean response, true 
if and only if v is absent (for insert(u)) or present (for remove(n)) in the list. After insert('(;) is complete, v is 
present in the list, and after remove(u) is complete, v is absent in the list. The contains(n)} returns a boolean 
a boolean, true if and only if v is present in the list. 

The sequential implementation LL of the set type is presented in Algorithm]^ The implementation uses 
a sorted linked list data structure in which each element (except the tail) maintains a next field to provide a 
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pointer to the successor node. Initially, the next field of the head element points to tail; head (resp. tail) is 
initialized with values —oo (resp. +oo) that is smaller (resp. greater) than the value of any other element in 
the list. 


B Proof of Theorem 1 

Theorem 1 (Correctness). Versioned List is linearizable with respect to the set type. 

Proof. We show that S defined in Section is consistent with the sequential specification of type set. When 
we refer to read(X), where X is a node, we mean the first read of a node’s field. 

Let be the prefix of S consisting of the first k complete operations. We associate each with a set 
of objects that were successfully inserted and not subsequently successfully removed in We show by 
induction on k that the sequence of state transitions in 5^ is consistent with operations’ responses in with 
respect to the set type. 

The base case fc = 1 is trivial: the tail node containing +oo is successfully inserted. Suppose that 5*’ is 
consistent with the set type and let tti with argument u S Z and response Ttti be the last operation of 
We want to show that ( 9 ^, tti, r^n) is consistent with the set type. 

(1) Let TTi = insert(u) return true in 5^+^. We show below that each preceding tt 2 = insert(u) returning true is 
followed by remove(u) returning true, such that tt 2 — tgfc+i remove(u) -^§k+i tti. Suppose the opposite. Observe 



contradiction. 

Let TTi = insert(u) return false in We show that there exists a preceding tt 2 = insert(u) returning true 

that is not followed by 713 = remove(u) returning true, such that 712 —tgfc+i 773 — tti. Suppose that such 
a 772 does not exist. Thus, tti must perform its last read on a node X that stores value v" > u, acquire the 
versioned lock on X (Line [4^ and return true —a contradiction to the assumption that tti returned false. 

It is easy to verify that the conjunction of the above two claims proves that Vq S Q; Vu S Z, 5^+^ satisfies 
{q,]nsen{v),qU{v},{v ^ q)). 

(2) If TTi = remove(w), similar arguments as applied to insert(z;) prove that Vg G Q; Vu G Z, 5'^+^ satisfies 
(g, remove(u),q\{u},(uG q)). 

(3) Let TTi = contains(u) return true in We show that there exists 112 = insert('(;) returning true that is not 

followed by any remove(u) returning trite, such that 712 —tgfc+i remove(u) —tti. Recall that tti is linearized 
at the last read of an node, say X, performed by tt when tt reads the deleted field of X to be false (Line[l^. By 
the algorithm, there exists 712 = insert(u) such that tt 2 —tgfe+i tti (let 772 be the latest such operation). Suppose 
that there exists a remove(i;) that returns true, such that tt 2 -^gk+i remove(w) —tti. Thus, remove(u) 


performs the write event in Line 56 prior to the read of X. deleted by tti. But then tti must read X. deleted to 
be true and return false —a contradiction. 

Now, let TTi = contains(i;) return false in Thus, (1) there exists a tt 2 = remove(u) returning true that 


is not followed by any insert(i;) returning true, such that 712 




msert(u) 




TTi, or (2) there does not 


exist any insert(i;) returning true such that insert(u) — > 5^+1 tti. We consider two cases: 


— Suppose that tti reads {X.value ^ v) in Line 18 where X is the last node read by tti in a. Thus, £t, 


IS 


assigned to the read of the next field of the node, say X' accessed by tti immediately before X. Assume by 
contradiction that there exists 712 = insert(u) that returns true such that there does not exist any remove(u) 
that returns true; tt 2 -^§k+i remove(z;) -^gk+i contains(u). But then tti must read {X.value = v) in Line 
and return true —contradiction. 
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— Suppose that tti reads {X.value = v) and X. deleted to be true in Line 18 Clearly, there exists a 772 = 
remove(i;) that is concurrent to tti and returns true in H. By the assignment of linearization points, is 
assigned to the first event performed by tt 2 immediately after the write to X.deleted, but prior to the read 
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of X.deleted by tti, where X is the last node read by tti. We consider two cases: (1) Suppose that some 
such event of tt 2 exists. We claim that there does not exist any tts = insert(w) that returns true such that 
-^§k+i TTa -^gk +1 TTi- Any such 712 must acquire the versioned lock on X' (Line [4^, the node read by tti 
immediately prior to X. Since tti reads {X.value = v) and X.deleted to be true, 1^2 must also acquire the 
versioned lock on X' (Line [5^. By our assumption, —tgfc+i (-k^- Thus, tts acquires the versioned lock 

on X' only after 112 releases it in Line 59 But we linearize tti prior to by choosing it to be the event 
performed by 712 in Line [57] —a contradiction to our assumption that <a i-Ki ■ (2) Otherwise, if no such 
event of 712 exists, fin is chosen as the read of X. deleted by tti. Since 772 does not release the versioned lock 
on X' prior to the read of X.deleted by tti, there does not exist any insert(?;) that returns true such that 
insert(u) -^gk+i Now, by the assignment of linearization points, 712 -^gk+i 


Thus, inductively, the sequence of state transitions in S satisfies the sequential specification of the set type. □ 
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